Supervised and semi-supervised statistical models for word-based sentiment analysis
نویسنده
چکیده
Ever since its inception, sentiment analysis has relied heavily on methods that use words as their basic unit. Even today, such methods deliver top performance. This way of representing data for sentiment analysis is known as the clue model. It offers practical advantages over more sophisticated approaches: It is easy to implement and statistical models can be trained efficiently even on large datasets. However, the clue model also has notable shortcomings. First, clues are highly redundant across examples, and thus training based on annotated data is potentially inefficient. Second, clues are treated context-insensitively, i.e., the sentiment expressed by a clue is assumed to be the same regardless of context. In this thesis, we address these shortcomings. We propose two approaches to reduce redundancy: First, we use active learning, a method for automatic data selection guided by the statistical model to be trained. We show that active learning can speed up the training process for document classification significantly, reducing clue redundancy. Second, we present a graphbased approach that uses annotated clue types rather than annotated documents which contain clue instances. We show that using a random-walk model, we can train a highly accurate document classifier. We next investigate the context-dependency of clues. We first introduce sentiment relevance, a novel concept that aims at identifying content that contributes to the overall sentiment of the review. We show that even when we have no annotated sentiment relevance data available, a high-accuracy sentiment relevance classifier can be trained using transfer learning and distant supervision. Second, we perform linguistically motivated analysis and simplification of a compositional sentiment analysis. We find that the model captures linguistic structures poorly. Further, it can be simplified without any loss of accuracy.
منابع مشابه
A Supervised Method for Constructing Sentiment Lexicon in Persian Language
Due to the increasing growth of digital content on the internet and social media, sentiment analysis problem is one of the emerging fields. This problem deals with information extraction and knowledge discovery from textual data using natural language processing has attracted the attention of many researchers. Construction of sentiment lexicon as a valuable language resource is a one of the imp...
متن کاملیک چارچوب نیمهنظارتی مبتنی بر لغتنامه وفقی خودساخت جهت تحلیل نظرات فارسی
With the appearance of Web 2.0 and 3.0, users’ contribution to WWW has created a huge amount of valuable expressed opinions. Considering the difficulty or impossibility of manually analyzing such big data, sentiment analysis, as a branch of natural language processing, has been highly considered. Despite the other (popular) languages, a limited number of research studies have been conducted in ...
متن کاملSemi-supervised Convolutional Neural Networks for Text Categorization via Region Embedding
This paper presents a new semi-supervised framework with convolutional neural networks (CNNs) for text categorization. Unlike the previous approaches that rely on word embeddings, our method learns embeddings of small text regions from unlabeled data for integration into a supervised CNN. The proposed scheme for embedding learning is based on the idea of two-view semi-supervised learning, which...
متن کاملSemi-Supervised Learning with Multi-View Embedding: Theory and Application with Convolutional Neural Networks
This paper presents a theoretical analysis of multi-view embedding – feature embedding that can be learned from unlabeled data through the task of predicting one view from another. We prove its usefulness in supervised learning under certain conditions. The result explains the effectiveness of some existing methods such as word embedding. Based on this theory, we propose a new semi-supervised l...
متن کاملSemi-Supervised Affective Meaning Lexicon Expansion Using Semantic and Distributed Word Representations
In this paper, we propose an extension to graph-based sentiment lexicon induction methods by incorporating distributed and semantic word representations in building the similarity graph to expand a threedimensional sentiment lexicon. We also implemented and evaluated the label propagation using four different word representations and similarity metrics. Our comprehensive evaluation of the four ...
متن کامل